Request Throttling

Overview

Let's start with the obvious question. Why do we want to or need to throttle requests? The DataStax driver is asynchronous (note that a blocking API is provided on top of the async API), making it easy to achieve very high levels of request throughput. Without restraint, it is possible to overload Cassandra and/or the driver. It should be pointed out that with a multi-node cluster, it is unlikely to overload the entire cluster with client requests. RHQ though fully supports single node deployments, and in those scenarios there is a greater chance of overloading that single node. Based on experience, we are far more likely to saturate the driver with requests to the point where they start timing out.

In RHQ 4.9 the only place where async requests are used is with the insertion of raw data. Since 4.9 we are making more use of the async API, particularly with metrics aggregation. In 4.9, aggregation was done entirely with the blocking requests; so, there was no need for any throttling. The increased use of asynchronous requests makes the need for a robust throttling mechanism all the more important. The throttling was originally implemented with a Semaphore, but that has been replaced with a RateLimiter. If you are unfamiliar with RateLimiter, see this article for a good overview.

There are a few requirements that need to be satisfied.

Provide reasonable defaults
Be self-tuning
Be user-configurable

These requirements are discussed in the following sections. Lastly, the work for this is being tracked under BZ 1045589.

Reasonable Defaults

While RHQ supports multi-node Storage Node clusters, it also supports single-node clusters. A default installation puts the RHQ Server, Storage Node, and agent all on the same machine. With a default installation, it is also fairly common to have the RHQ relational database residing on the same machine. The default settings for request throughput need to be sensible for a default deployment. For larger deployments and for deployments with an increasing load, higher request throughput becomes necessary. The self-tuning and user-configurable requirements that are discussed next address these concerns.

Self-Tuning

There are a handful of scenarios in which it makes sense for the throttling to automatically adjust itself.

Storage Node Added

When additional nodes are deployed, the throughput can automatically increase by a configurable delta.

Storage Node Removed

When a storage node is removed from the cluster, the throughput can automatically decrease by a configurable delta.

Storage Node Down

When a storage node goes down, throughput can decrease by a configurable delta.

Storage Node Up

When a previously down storage node comes back up, throughput can increase to previous levels.

Request Timeout

When a request times out, either on the client side or on the server side, throttling should be increased by a configurable delta. It also makes sense to fire an alert in this situation since the timeout is very strong indicator that we are not able to keep up with current load and deploying additional nodes is advised. The topic of alerting will be covered more fully by BZ 1045584.

User-Configurable

There are a few things that ought to be configurable so that users can adjust the throttling to their needs when the defaults are inadequate.

Request Throughput Delta

This is the amount by which request throughput should increase or decrease.

Minimum Request Threshold

This is a minimum threshold for request throughput. If throughput continues to decrease, it will not drop below this value. Initially it might seem counter-intuitive to set a minimum threshold, but it is actually quite necessary. If we allow the throughput to continually decrease (as might be the case with successive request timeouts), we could wind up setting the rate to a value less than or equal to zero. This would result in a state in which IllegalArgumentExceptions are continually thrown by RateLimiter.

Timeout Dampening

If/when there is a request times out, it is likely to be accompanied by other timeouts. RHQ inserts raw data in batches. Agents send measurement reports to the server. If a timeout occurs during one of those writes, there a good chance that more timeouts will follow. If there are multiple timeouts occur within a configurable window, then the throughput should only be decreased once.